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Abstract 

The  Rayleigh-Ritz  (RR)  procedure,  including  orthogonalization,  constitutes  a  major  bottleneck  in 
computing  relatively  high-dimensional  eigenspaces  of  large  sparse  matrices.  Although  operations  in¬ 
volved  in  RR  steps  can  be  parallelized  to  a  certain  level,  their  parallel  scalability,  which  is  limited  by 
some  inherent  sequential  steps,  is  lower  than  dense  matrix-matrix  multiplications.  The  primary  moti¬ 
vation  of  this  paper  is  to  develop  a  methodology  that  reduces  the  use  of  the  RR  procedure  in  exchange 
for  matrix-matrix  multiplications.  We  propose  an  unconstrained  penalty  model  and  establish  its  equiva¬ 
lence  to  the  eigenvalue  problem.  This  model  enables  us  to  deploy  gradient-type  algorithms  that  makes 
heavy  use  of  dense  matrix-matrix  multiplications.  Although  the  proposed  algorithm  does  not  necessarily 
reduce  the  total  number  of  arithmetic  operations,  it  leverages  highly  optimized  operations  on  modern 
high  performance  computers  to  achieve  parallel  scalability.  Numerical  results  based  on  a  preliminary 
implementation,  parallelized  using  OpenMR  show  that  our  approach  is  promising. 
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1  Introduction 

Eigenvalue  and  eigenvector  calculation  is  a  fundamental  computational  problem  with  extraordinarily  wide- 
ranging  applications.  In  the  past  several  decades,  a  great  deal  of  progress  has  been  made  in  the  development 

’Department  of  Mathematics,  MOE-LSC  and  Institute  of  Natural  Sciences,  Shanghai  Jiaotong  University,  CHINA 
(zw2109@sjtu.edu.cn).  Research  supported  in  part  by  NSFC  grant  11101274,  and  Humboldt  Research  Fellowship  for  Experi¬ 
enced  Researchers. 

' Computational  Research  Division,  Lawrence  Berkeley  National  Laboratory,  Berkeley,  UNITED  STATES  (cyang@lbl.gov). 
Support  for  this  work  was  provided  through  the  Scientific  Discovery  through  Advanced  Computing  (SciDAC)  program  funded  by 
U.S.  Department  of  Energy.  Office  of  Science,  Advanced  Scientific  Computing  Research  (and  Basic  Energy  Sciences)  under  award 
number  DE-SC0008666. 

*  State  Key  Laboratory  of  Scientific  and  Engineering  Computing,  Academy  of  Mathematics  and  Systems  Science,  Chinese 
Academy  of  Sciences,  CHINA  (liuxin@lsec.cc.ac.cn).  Research  supported  in  part  by  NSFC  grant  11101409  and  10831006,  and 
the  National  Center  for  Mathematics  and  Interdisciplinary  Sciences,  CAS. 

§  Department  of  Computational  and  Applied  Mathematics,  Rice  University,  UNITED  STATES  (yzhang@rice.edu).  Research 
supported  in  part  by  NSF  Grant  DMS-081 1 188,  ONR  Grant  N00014-08- 1-1 101,  and  NSF  Grant  DMS-1 1 15950. 


1 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

06  MAR  2013 


2.  REPORT  TYPE 


4.  TITLE  AND  SUBTITLE 

Trace-Penalty  Minimization  for  Large-scale  Eigenspace  Computation 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Rice  University, Department  of  Computational  and  Applied 
Mathematics, Houston, TX, 77005 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2013  to  00-00-2013 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

The  Rayleigh-Ritz  (RR)  procedure,  including  orthogonalization,  constitutes  a  major  bottleneck  in 
computing  relatively  high-dimensional  eigenspaces  of  large  sparse  matrices.  Although  operations  involved 
in  RR  steps  can  be  parallelized  to  a  certain  level,  their  parallel  scalability,  which  is  limited  by  some 
inherent  sequential  steps,  is  lower  than  dense  matrix-matrix  multiplications.  The  primary  motivation  of 
this  paper  is  to  develop  a  methodology  that  reduces  the  use  of  the  RR  procedure  in  exchange  for 
matrix-matrix  multiplications.  We  propose  an  unconstrained  penalty  model  and  establish  its  equivalence  to 
the  eigenvalue  problem.  This  model  enables  us  to  deploy  gradient-type  algorithms  that  makes  heavy  use  of 
dense  matrix-matrix  multiplications.  Although  the  proposed  algorithm  does  not  necessarily  reduce  the 
total  number  of  arithmetic  operations,  it  leverages  highly  optimized  operations  on  modern  high 
performance  computers  to  achieve  parallel  scalability.  Numerical  results  based  on  a  preliminary 
implementation,  parallelized  using  OpenMP,  show  that  our  approach  is  promising. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

30 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


of  efficient  algorithms  and  solvers  for  various  types  of  eigenvalue  problems.  Iterative  methods  are  usually 
preferred  for  solving  large-scale  problems  because  of  their  ability  to  take  advantage  of  sparsity  or  other  struc¬ 
tures  existing  in  the  matrices  of  interest.  When  a  few  eigenpairs  arc  needed,  the  task  of  sparse  matrix- vector 
multiplications,  which  can  often  be  performed  efficiently  on  both  sequential  and  modern  parallel  comput¬ 
ers,  usually  constitutes  the  dominant  computational  cost.  However,  as  the  number  of  desired  eigenpairs 
increases,  the  computational  costs  in  an  iterative  eigensolver  can  shift  to  other  linear  algebra  operations. 

There  arc  two  types  of  operations  that  can  potentially  become  bottlenecks.  One  is  the  construction  and/or 
maintenance  of  orthonormal  bases  for  subspaces  from  which  approximate  eigenvalues  and  eigenvectors  arc 
extracted  at  each  iteration.  This  type  of  operations  is  often  carried  out  through  either  a  Gram-Schmidt 
(including  Arnoldi  or  Lanczos)  procedure  or  a  QR  factorization  at  the  complexity  of  at  least  ()(nk2)  where 
n  is  the  dimension  of  the  target  matrix  and  k  is  the  number  of  desired  eigenpairs.  Another  potentially 
high-cost  procedure  is  the  Rayleigh-Ritz  (RR)  calculation  [13]  used  to  extract  eigenvalue  and  eigenvector 
approximations  from  a  subspace  of  dimension  p  >  k.  The  RR  procedure  involves  solving  a  p-dimensional 
dense  eigenvalue  problem  and  assembling  the  so-called  Ritz  vectors  which  arc  approximate  eigenvectors  in 
the  original  space.  Because  the  Ritz  vectors  are  mutally  orthonormal,  the  RR  procedure  can  sometimes  be 
viewed  as  a  way  to  construct  an  orthonormal  basis  also.  The  complexity  for  the  RR  procedure  is  at  least 
0(nk2  +  k3).  When  the  number  k  is  small,  the  costs  of  these  two  types  of  operations  arc  minor  or  even 
negligible.  However,  when  k  increases  to  a  moderate  portion  of  the  matrix  dimension  n,  these  costs  can 
represent  a  significant,  even  dominant,  portion  of  the  overall  cost. 

The  use  of  parallel  computers  can  greatly  reduce  the  solution  time.  However,  to  make  efficient  use  of 
these  computers,  we  must  ensure  that  our  algorithm  is  scalable  with  respect  to  the  number  of  processors 
or  cores.  Although  the  standard  Krylov  subspace  iterative  algorithms  can  be  parallelized  through  the  par¬ 
allelization  of  the  sparse  matrix  vector  multiplications  (SpMV)  and  other  dense  linear  algebra  operations, 
the  amount  of  parallelism  is  limited  because  SpMVs  must  be  done  in  sequence  in  these  algorithms,  and 
each  SpMV  can  only  make  effective  use  of  a  limited  number  of  processing  units  in  general.  Block  methods, 
such  as  the  locally  optimal  block  preconditioned  conjugate  gradient  (LOBPCG)  algorithm  [10],  the  block 
Krylov-Schur  algorithm  [21]  and  the  Chebyshev-Davidson  algorithm  [19,  20],  arc  more  scalable  because 
more  concurrency  can  be  exploited  in  multiplying  a  sparse  matrix  with  a  block  of  vectors. 

However,  block  methods  have  so  far  not  addressed  the  relatively  high  cost  of  performing  an  RR  calcu¬ 
lation  at  each  iteration.  Although  parallel  algorithms  for  solving  the  dense  projected  eigenvalue  problem 
are  available  in  multi-thread  LAPACK  [2]  libraries  for  shared-memory  paralle  computers  and  in  the  ScaLA- 
PACK  [4]  library  for  distributed-memory  parallel  computers,  the  parallel  efficiency  of  these  algorithms  is 
often  limited  to  a  relatively  small  number  of  processors  or  cores.  When  a  large  number  of  processing  units 
are  involved,  the  thread  or  communication  overhead  can  be  significant.  One  way  to  address  this  issue  is  to 
use  a  “spectrum  slicing”  algorithm  [1,  5,  8]  that  divides  the  paid  of  the  spectrum  of  interest  into  a  number 
of  intervals  and  compute  eigenvalues  within  each  interval  in  parallel.  However,  this  approach  would  re¬ 
quire  computing  interior  eigenvalues  in  each  interval  which  is  generally  a  more  difficult  task.  Moreover,  a 
good  initial  guess  of  the  eigenvalues  of  interest  is  needed  so  that  the  spectrum  can  be  divided  in  an  efficient 
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manner  [1]. 

In  the  Chebyshev-Davidson  algorithm  [19,  20],  the  number  of  RR  steps  is  amortized  over  a  large  number 
of  SpMVs  because  a  Chebyshev  matrix  polynomial  filter  is  applied  to  a  block  of  vectors  before  an  RR 
calculation  is  performed  to  update  the  approximate  eigenvectors.  However,  an  apparent  drawback  of  this 
algorithm  is  the  difficulty  to  take  advantage  of  a  good  pre-conditioner  when  it  is  available.  In  addition,  it  is 
difficult  to  apply  a  Chebyshev  polynomial  filter  to  generalized  eigenvalue  problems. 

In  this  paper,  we  present  a  block  algorithm  for  computing  k  algebraically  smallest  eigenvalues  of  a  real 
symmetric  matrix  A  £  Mnxn  and  their  corresponding  eigenvectors,  though  the  same  methodology  can  easily 
be  applied  to  compute  the  largest  eigenvalues  and  to  compute  eigenvalues  of  complex  Hernritian  matrices. 
Our  approach  starts  from  the  trace  minimization  formulation  for  eigenvalue  problems.  It  is  well  known  that 
the  invariant  subspace  associated  with  a  set  of  k  algebraically  smallest  eigenvalues  of  A  yields  an  optimal 
solution  to  the  following  trace  minimization  problem  with  orthogonality  constraints 

min  tr {XTAX),  s.t.  =  /.  (1) 

xeRnXk 

A  major  theoretical  result  of  this  paper  is  to  establish  an  equivalence  relationship  between  problem  (1)  and 
the  following  unconstrained  optimization  problem 

min  f^X)  :=  Ur(XTAX)  +  ^\\XTX-I\\2F,  (2) 

xeR"xfc  2  4 

when  the  penalty  parameter  //  >  0  takes  suitable  finite  values.  As  is  well  recognized,  the  objective  function 
in  (2)  is  the  classic  quadratic  (or  Courant)  penalty  function  [6,  12,  15]  for  the  constrained  problem  (1). 
Generally  speaking,  the  classic  quadratic  penalty  model  approaches  the  original  constrained  problem  only 
as  the  penalty  parameter  /  t  goes  to  infinity.  However,  we  show  that  in  terms  of  finding  an  optimal  eigenspace 
problem  (2)  is  essentially  equivalent  to  (1)  when  the  penalty  parameter  //  is  appropriately  chosen. 

We  will  call  the  approach  of  solving  model  (2)  trace-penalty  minimization.  A  key  difference  between 
trace  minimization  and  trace-penalty  minimization  is  that  explicit  orthogonality  of  X  is  no  longer  required 
in  the  latter,  which  immediately  opens  up  the  possibility  of  doing  far-  fewer  RR  steps  including  far-  fewer 
orthogonalizations  and  other  RR-related  operations.  In  exchange,  as  will  be  demonstrated  later,  more  dense 
matrix-matrix  multiplications  are  performed  (to  a  less  extent,  also  more  SpMV).  A  major  potential  advantage 
of  replacing  RR  steps  by  dense  matrix-matrix  multiplications  is  that  the  latter  operations  have  much  better 
parallel  scalability  and  are  highly  optimized  for  modern  high  performance  computers.  In  addition,  one  could 
incorporate  pre-conditioning  into  trace-penalty  minimization  in  a  straightforward  manner. 

In  this  paper,  we  consider  applying  gradient-type  methods  to  the  trace-penalty  minimization  problem 
(2).  These  methods  can  often  quickly  reach  the  vicinity  of  an  optimal  solution  and  produce  a  moderately 
accurate  approximation.  In  many  applications,  rough  or  moderately  accurate  approximations  arc  often  suf¬ 
ficient.  One  of  such  instances  is  when  solving  a  nonlinear-  eigenvalue  problem,  one  approximately  solves  a 
sequence  of  linearized  eigenvalue  problems  one  after  another  (such  as  in  solving  the  Kohn-Sham  equation 
in  electronic  structure  calculation  by  “self-consistent  field”  iterations  [14]).  In  this  paper  we  evaluate  the 
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efficiency  of  our  algorithm  not  by  measuring  the  time  it  takes  to  compute  eigenpairs  to  a  high  accuracy  close 
to  machine  precision,  but  rather  the  time  it  takes  to  achieve  a  moderate  accuracy  in  computed  eigenpairs. 
Once  good  estimates  arc  at  hand,  there  exist  a  number  of  techniques  that  can  perform  further  refinements 
to  obtain  a  higher  accuracy.  For  example,  moderately  accurate  estimates  can  be  further  improved  by  using 
a  “spectral  slicing”  type  of  algorithm  [1,  5,  8],  Another  possibility  is  to  apply  polynomial  filtering  to  do 
refinements.  For  the  proposed  trace-penalty  minimization  problem  (2),  we  have  experimented  with  various 
algorithmic  options  in  Matlab,  developed  a  Fortran  implementation  and  parallelize  it  using  OpenMP.  Pre¬ 
liminary  numerical  comparison  with  some  of  the  existing  approaches  shows  that  our  approach  is  promising. 

The  rest  of  this  paper  is  organized  as  follows.  We  analyze  the  trace-penalty  minimization  model  in 
Section  2.  Our  algorithms  and  several  implementation  details  are  discussed  in  Section  3.  Numerical  results 
arc  reported  in  Section  4.  Finally,  we  conclude  the  paper  in  Section  5. 

2  Trace-Penalty  Minimization:  Model  Analysis 

For  a  given  real  symmetric  matrix  A  =  AT  E  Mnxn,  an  eigenvalue  decomposition  of  A  is  defined  as 

A  =  QnAnQn-,  (3) 

where,  for  any  integer  i  £  [1,  n], 

Qi  =  [<7i,  <72,  •  •  • ,  Qi]  €  Mnx\  A i  =  diag(Ai,  A2,  •  •  • ,  A*)  £  M*x*,  (4) 

so  that  QjQi  =  I  £  M!Xz  and  A,;  is  diagonal.  The  columns  q\.  ...  .  qn  of  Qn  arc  eigenvectors  of  A  associated 
with  eigenvalues  Ai,  A2,  •  ■  ■  ,  An,  respectively,  which  arc  assumed  to  be  in  an  ascending  order, 


Ai  <  <  •  •  •  <  An. 


We  note  the  non-uniqueness  of  eigenvalue  decomposition  (3).  One  could  not  only  alter  the  signs  of  eigen¬ 
vectors,  but  also  choose  different  unit  eigenvectors  associated  with  eigenvalues  of  multiplicity  greater  than 
one.  For  convenience,  we  will  treat  (3)  as  a  generic  form  of  decomposition  that  represents  all  possible 
alternatives. 

Given  a  positive  integer  k  <  n,  it  is  well  known  that  the  eigenvector  matrix  O/.  is  a  solution  to  the 
trace  minimization  problem  (1).  As  is  stated  in  the  introduction,  instead  of  solving  (1)  directly,  we  propose 
to  solve  the  trace-penalty  minimization  problem  (2).  We  first  analyze  the  relationship  between  the  two 
problems  (1)  and  (2),  and  then  derive  some  useful  properties  for  (2). 

2.1  Equivalence  and  other  Properties 

We  start  with  the  following  definition  of  equivalence. 
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Definition  2.1.  Problem  (2)  is  said  to  be  equivalent  to  (1)  if  each  of  its  global  minimizers  spans  a  k- 
dimensional  eigenspace  associated  with  k  smallest  eigenvalues  of  A. 

The  first-order  necessary  conditions  for  trace  minimization  (1)  can  be  written  as 

AX  -  XiX^AX)  =  0,  XTX  =  I. 

On  the  other  hand,  the  first-order  necessary  condition  for  trace-penalty  minimization  (2)  is  simply 

V/M(X)  =  AX  +  pX(XTX-I)  =  0.  (5) 

Let  L(Mnxfc,  Wixk)  be  the  space  of  linear  operators  that  map  Wixk  to  Mnxfc.  The  Frechet  derivative  of 
V/^  at  X  is  defined  as  the  (unique)  function  V2/M  :  Mnxfc  — >  L(M.nxk,  Mnxfc)  such  that 

r  \\VU(X  +  S)  Vf,(X)  -  V2f,(X)(S) ILf 

(S-I/.-X)  \\S\\F 

It  can  be  easily  verified  that 

V2/m(X)(5)  =  AS  +  pS(XTX  -  I)  +  pX(STX  +  XTS),  (6) 

from  which  one  can  also  derive  the  matrix  representation  of  V2ffl(X)  in  terms  of  Kronecker  products. 

The  first-order  necessary  condition  (5)  implies  that  each  stationary  point  X  of  (2)  spans  an  invariant 
subspace  of  A,  since  Xflt(X)  =  0  is  obviously  equivalent  to 

AX  =  X(I  -  XTX)p. 

Trivially,  1  =  06  Mnxfc  is  always  a  stationary  point  of  (2).  We  first  study  this  trivial  stationary  point  and 
show  that  it  can  be  eliminated  as  a  minimizer  if  the  penalty  parameter  //  is  sufficiently  large. 

Lemma  2.2.  Let  p  >  0.  If  p  <  Ai,  the  zero  matrix  X  =  0  <G  Mn  x  k  is  the  only  stationary  point  of  problem 
(2);  otherwise,  it  is  not  a  minimizer.  Moreover,  X  =  0  is  a  maximizer  when  p  >  \n. 

Proof  Rearranging  (5),  we  have 

(pi  -  A)X  =  pX(XTX).  (7) 

Multiplying  XT  on  both  side  of  (7)  yields 

XT(pI  -  A)X  =  p( XTX)2.  (8) 

I f  p  <  A  | ,  the  matrix  on  the  left  is  negative  semidefinite  while  the  one  on  the  right  is  positive  semidefinite, 
forcing  the  only  solution  X  =  0.  When  p  >  Ai,  it  suffices  to  note  that  the  Hessian  of  /),  at  X  =  0  is 
V2/m( 0)  =  I  0  (A  —  pi)  which  is  not  positive  semidefinite.  Finally,  we  note  that  V2//;(0)  is  negative 
definite  when  p  >  An.  □ 
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The  next  lemma  shows  that  any  stationary  point  of  (2)  can  be  expressed  in  terms  of  eigenpairs  of  A. 

Lemma  2.3.  Let  p  >  0  and  ( U ,  D )  G  Mnxfc  x  Mfcxfc  denote  k  eigenpairs  of  A  so  that  AU  =  UD,  UTU  =  I 
and  D  is  diagonal.  A  matrix  X  G  M.nxk  is  a  stationary  point  of  (2)  if  and  only  if 

X  =  U[P(I  -  D/p)]l/2VT,  (9) 


where  V  G  M.kxk  is  an  arbitrary  orthogonal  matrix,  and  P  G  Rfcxfc  is  a  diagonal,  projection  matrix  with 
diagonal  entries 


P  — 

1  1.1.  - 


0, 

[  0  or  1, 

In  Particular,  X  is  a  rank-k  stationary  point  only  if  P 


if  d  <  Du, 
other-wise. 

I  and  pi  —  D  y  0  ( being  positive  definite ). 


(10) 


Proof.  We  will  provide  a  proof  for  the  case  where  X  is  of  full-rank.  The  rank-deficient  cases  can  be  proved 
along  the  similar  line,  though  more  notationally  involved  and  tedious. 

Suppose  that  X  is  a  full  rank  stationary  point,  which  spans  an  invariant  subspace  of  A.  Since  every 
/-dimensional  invariant  subspace  of  A  can  be  spanned  by  a  set  of  k  eigenvectors,  we  can  write  X  =  UW 
where  U  consists  of  k  unit  eigenvectors  of  A  and  W  e  M.kxk  is  nonsingular.  Upon  substituting  X  =  UW 
into  (7),  we  derive 


U{pl  -  D)W  =  pUW{WTW)  ^  I-D/p  =  WWT  ^  W  =  (/  -  D/p)1/2VT 


for  some  orthogonal  V  G  M.kxk  (which  can  possibly  hold  only  if  p  >  Du  for  i  =  1, 2  •  •  •  ,  k ).  □ 

Now  we  establish  the  equivalence  between  the  trace-penalty  minimization  model  (2)  and  the  trace  min¬ 
imization  model  (1)  for  proper  p  values. 

Theorem  2.4.  Problem  (2)  is  equivalent  to  (1)  if  and  only  if 


p>  max(0,  Afc).  (11) 

Specifically,  any  global  minimizer  X  of  (2)  has  a  singular-value  decomposition  of  the  form: 

X  =  Qk(I-Ak/p)1/2VT  (12) 

where  Qk  and  Ak  are  defined  as  in  (4),  and  V  G  M.kxk  is  any  orthogonal  matrix. 

Proof.  It  can  be  easily  seen  from  (8)  that  condition  (1 1)  is  necessary  for  the  existence  of  a  rank-A:  stationary 
point.  On  the  other  hand,  suppose  that  p  satisfies  (11).  Using  Lemma  2.3,  it  is  suffice  to  consider  the 
representation  X  =  UW,  where  U  consists  of  any  k  eigenvectors  of  A  and  IU  G  Mkxk.  Hence,  we  obtain 

2fifX)  =  tr (DWWt)  +  ^\\WTW  -  /|||, 
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where  D  =  Diag(d)  <G  M.kxk  is  a  diagonal  matrix  with  k  eigenvalues  of  A  on  the  diagonal  corresponding  to 
eigenvectors  in  U.  A  short  calculation  shows  that 


2  MX)  =  lfiwWT  +  (Dhi-l)\\2F  +  «(D)-/-tt(D2) 

>  |||(D/,.-/)  +  |||.  +  ti-(D)-d;u-(D2) 


where  t+  =  max(0,  t)  and 


i=  1 
k 


2  \p 


d2 


2p 


i=  1 


P  (d 


m  =  7i  A-1  +  d--  = 


2  V/i 


d2 


2p 


d  —  d2/(2p),  d  <  p, 
p/2,  d  >  p. 


Note  that  d(d)  is  monotonically  nondecreasing  since  d'(d)  =  1  —  d/p  >  0  in  (— oo,  p). 
Substituting  the  formulation  of  A  defined  in  (12)  into  //t(A),  we  obtain 


2 MX)  =  tr(Afc)  -  ^-tr(A2)  =  £  0(A *)  <  2/M(A), 

^  i=  1 


which  verifies  that  A  is  a  global  minimizer.  This  completes  the  proof.  □ 

The  next  theorem  indicates  that  our  trace-penalty  minimization  model  (2)  can  have  far  fewer  undesirable, 
full-rank  stationary  points  than  the  trace  minimization  model  (1).  Hence,  when  the  penalty  parameter  is  suit¬ 
ably  chosen,  one  could  reasonably  argue  that  from  an  optimization  point  of  view  trace-penalty  minimization 
is  theoretically  more  desirable  than  trace  minimization. 

Theorem  2.5.  If  p  6  (max(0,  A*,),  An),  then  //(A)  has  no  local  maxima,  nor  local  minima  other  than 
the  global  minimum  attained  by  X  defined  in  (12).  Moreover,  if  p  6  (max(0,  \f),  Xk+p)  where  Xk+P  is 
the  smallest  eigenvalue  greater  than  A k,  then  all  k-dimensional  stationary  points  of  //(A)  must  be  global 
minimizers. 

Proof  To  prove  the  first  statement,  we  show  that  for  p  £  (max(0,  A/c),  Xn)  any  stationary  point  other  than 
the  global  minimizers  can  only  be  saddle  points. 

Without  loss  of  generality,  consider  stationary  points  in  the  form  of  (9)  with  V  =  I,  that  is 


A  =  U[P(I  -  D/p)]1/2  =  U[(I  -  D / p)P]1/2,  (13) 


where  AU  =  UD,  UTU  =  /,  and  D  is  diagonal.  The  proof  still  holds  for  an  arbitrary  orthogonal  matrix 
V  since  the  function  value  //(A)  is  invariant  with  respect  to  V.  Substituting  (13)  into  the  Hessian  formula 
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(6),  we  obtain 


V2/M(X)(S)  =  AS  -  S(/i(J  -  P)  +  DP)  +  n X(STX  +  XTS).  (14) 

We  next  show  that  there  exists  two  different  matrices  S  G  Mnx such  that  tr (STX2  ffl(X)(S))  <  0  and 
tr(,S'TV2///(X)(;S'))  >  0,  respectively,  unless  the  stationary  point  X  is  constructed  from  eigenvectors  asso¬ 
ciated  with  a  set  of  k  smallest  eigenvalues  which  corresponds  to  the  global  minimum. 

First  assume  that  X  has  full  rank.  Then  /x/  >~  D  and  P  =  I  in  (13).  Letting  P  =  I  in  (14)  yields 

X2ffl(X){S)  =  AS  —  SD  +  /x X(STX  +  XTS). 

For  S  =  U,  we  have  STX  =  XTS  =  (I  -  D/n)1!2  and 

tr (STX2f^X)(S))  =  0  +  2tr(/i/  -  D)  >  0. 

On  the  other  hand,  if  X  is  not  a  global  minimizer,  without  loss  of  generality  we  can  assume  that  U  contains 
q:j  but  not  q,  where  A*  <  A j.  Let  S  contain  all  zero  columns  except  a  single  nonzero  column  that  is  q,  at  the 
position  so  that  the  only  nonzero  column  of  SD  is  q,Xj.  For  such  an  S,  we  have  ST X  =  0  and 

tr (5TV2//i(X)(5))  =  qj(Aqi  -  qiXj)  +  ^tiiSTX(STX  +  XTS))  =  (A*  -  Ay)  +  0  <  0. 

Hence,  all  full-rank  stationary  points  arc  saddle  points  except  the  global  minimizers. 

We  now  consider  the  rank-deficient  case,  namely,  there  exists  at  least  one  zero  entry  in  the  diagonal  of 
P,  say  Pa  =  0  for  some  i  G  [1  ,k\.  Let  U  be  the  remaining  matrix  after  deleting  the  z-th  column  from  U . 
Since  rank((7)  =  k  —  1,  there  must  exist  at  least  one  column,  denoted  by  qj,  of  Qj.  that  is  not  contained  in 
U.  Then  it  holds  qJU  =  0  and  qj Aqj  <  A/,..  Let  S  contain  all  zero  columns  except  one  nonzero  column 
that  is  qj  at  the  z-th  position  so  that  both  SP  =  0  and  ,S'  r4(  =  0.  Consequently,  in  view  of  (14)  we  have 

tr(5TV2/M(X)(,S))  =  qjAqj  -  n  +  Mtr (STX(STX  +  XTS))  <  (Afc  -  q)  +  0  <  0. 

On  the  other  side,  let  S  contain  all  zero  columns  except  that  the  i-th  column  is  qn.  For  any  integer  l  G  [1,  A:], 
if  the  column  Ui  =  qn,  then  it  follows  from  Lemma  2.3  that  Pu  =  0  and  q^Xi  =  0.  Otherwise,  the 
column  Ui  /  qn,  thus  qJjJi  =  0  which  implies  qJ:X  =  0.  By  our  assumption,  //  <  qJ,  Aqn  =  Xn.  Hence, 
U'(SJX2  ffJXjiS))  =  Xn  —  //  >  0.  This  complete  the  proof  of  the  first  statement. 

The  second  part  of  this  theorem  is  a  direct  consequence  of  the  full-rank  requirement  and  the  stationary- 
point  expression  (9)  which,  together,  demands  / il  y  D.  Hence,  for  //  G  (max(0,  A&),  A/,.+/)),  D  can  only 
have  a  set  of  k  smallest  eigenvalues  of  A  on  its  diagonal.  □ 

2.2  Error  Bounds  between  Optimality  Conditions 

After  establishing  the  equivalence  between  our  trace-penalty  minimization  model  (2)  and  the  original  trace 
minimization  model  (1),  we  investigate  the  relationship  between  the  first-order  optimality  conditions  of  the 


two  models,  which  would  play  an  important  role  in  setting  the  stopping  tolerance  for  an  iterative  algorithm 
to  solve  (2). 

Given  any  approximate  solution  X  of  (2),  an  orthonormal  basis  for  the  range  space  of  X,  say  Y(X),  is  a 
feasible  solution  of  (1).  Specifically,  let  X  be  of  full  rank  and  X  =  UY  VJ  denote  the  partial  (or  economy- 
form)  singular  value  decomposition  (SVD)  of  X,  where  U  G  Rnx/,:  and  V  G  Mfcx*:  have  orthonormal 
columns,  and  S  G  Mfcxfc  is  a  diagonal  matrix  with  the  singular  values  of  X  on  its  diagonal.  Then  a  particular 
choice  for  Y{X )  is 

Y(X)  =  U.  (15) 

Consequently,  the  violation  of  the  first-order  necessary  conditions  of  the  trace  minimization  (1)  can  be 
measured  by  the  Frobenious  norm  of  the  residual 

R(X)  =  AY(X )  -  Y(X)  ( Y(X)tAY(X ))  .  (16) 

Lemma  2.6.  Let  p  >  max(0,  A^),  andXJ  f^(X)  and  R(X)  be  defined  as  in  (5)  and  (16),  respectively.  Then 

||^)||F<a-in(Af)||V/M(X)||F,  (17) 

where  crmin(Af )  is  the  smallest  singular  value  of  X.  Moreover,  for  any  global  minimizer  X  and  any  e  >  0, 
there  exists  5  >  0  such  that  whenever  ||X  —  X ||p  <  5, 

II^POIIf  <  /11  +  '  ||V/^(X)||F.  (18) 

V1  —  xk/d 

Proof.  Recall  that  X  =  UYV1  where  the  columns  of  U  form  an  orthonormal  basis  for  the  range  space  of 
X .  Projecting  V  f^(X)  onto  the  null  space  of  XT  and  using  the  definition  of  R(X)  in  (16),  we  obtain 

(/  -UUT)\7f^X)  =  (1  -UUT)(AX  +  pX(XTX-I)) 

=  (I-  UUt)AX  =  (/  -  UUt)AUYVt 

=  R(X)YVJ. 


A  rearrangement  of  the  above  equality  gives 

R(X)  =  (I-UUt)VU(X)  vy~\ 
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which  leads  to  (17)  through  the  following  steps, 


\\R(x)\\f  =  ||(/-  ccT)v/Mpr)C£_1||F 

<  ||(/  —  UUT)V f h{X)V\\f  ||£-1||2 
=  \\(I-UUT)Vf,(X)\\Fa^n(X) 

<  <t^L(X)\\VU(X)\\f, 


where  the  last  inequality  is  due  to  the  fact  that  I  —  UUT  is  a  projection. 

To  see  the  second  paid  of  this  lemma,  we  only  need  to  recall  Theorem  2.4  that  gives 


^min 


(X)  =  y/1  -  A k/fi, 


for  any  global  minimizer  X.  This  completes  the  proof. 


□ 


2.3  Condition  Number  of  the  Hessian  at  Solution 


An  important  quantity  for  a  smooth  unconstrained  optimization  model  is  the  condition  number  of  the  Hes¬ 
sian  at  solution,  which  can  be  defined  in  our  case  as 


<V2f,(X)) 


Ama^V2/^)) 

Amin(V2/^))  ’ 


where  Amax(-)  (or  Amin(-))  stands  for  the  largest  (or  the  smallest)  eigenvalue  of  the  referred  matrix,  and  X 
is  a  global  minimizer  of  (2).  Obviously,  k  is  infinity  when  the  involved  matrix  is  singular. 

In  the  next  lemma,  we  calculate  the  condition  number  k(V2  f^(X))  for  the  case  k  =  1,  and  give  a  lower 
bound  for  the  general  case. 

Lemma  2.7.  Let  p  >  max(0,  A &)  and  X  be  a  global  minimizer  of  in  (2).  The  condition  number  of  the 
Hessian  of  ffl  at  X  satisfies 


max  (2(fi  -  Ai),  Ara  -  Aj)  \n  -  Ai 
min  (2(/i  A/j),  Afc_)_i  A/j)  A^-i-i  A^ 


(19) 


In  particular,  for  k  =  1  the  first  inequality  above  holds  as  an  equality  . 


Proof  As  is  well  known,  the  largest  or  smallest  eigenvalues  of  a  symmetric  matrix  can  be  obtained  by 
maximizing  or  minimizing  the  Rayleigh-Ritz  quotient,  namely,  in  our  case, 


Amax(V2/M(X))  =  trmax=itr  (STV2/M(X)(S)), 
Amin(V2/M(X))  =  mm=itr  {STV2  U(X)(S)). 


We  first  treat  the  case  of  k  =  1  in  which  X  and  S  refer  to  vectors,  so  we  simplify  them  as  x  and  s.  Also, 
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the  trace  constraint  tr(,S'T,S')  =  1  becomes  sT-s  =  1.  Applying  theorem  2.4,  x  takes  the  form 


x  =  ±qi  \/l  -  Ai/yu. 

Hence,  fi(xTx  —  1)  =  —  Ai.  Using  (6)  with  s  =  Qy  and  yTy  =  1,  we  have 

sTV2/m(x)s  =  stAs  -  Ai  +  2  y(sTx)2  =  yT  [A  -  Ai  I  +  2  (/j  -  Ai)eie[]  y. 


Therefore, 


max  stV2/m(x)s  =  max  (2(/r  —  Ai),  \n  —  Ai) , 

STS=1 

min  sTV2  fn(x)s  =  min  —  Ai),  A2  —  Ai) . 

STS=1 

Hence,  (19)  holds  as  an  equality. 

Now  we  treat  the  general  case.  Let  us  recall  theorem  2.4  that  X  takes  the  form  of  (12).  Since 
tr  (STV2f^X)(S))  =  tr  ((SV)T\72f^(XV)(SV)), 

and  tr((5U)T5U)  =  tr(5T5),  the  orthogonal  matrix  U  in  a  solution  X,  as  is  defined  in  (12),  does  not 
change  the  maximum  or  minimum  of  tr(.S'TV2//,  (A'')(;S'))  under  the  trace  constraint  tr(5TS)  =  1.  Without 
loss  of  generality,  in  this  proof  we  set  V  =  I  and  only  consider 

X  =  Qk(I-  AfcM)1/2. 


In  view  of  (6),  we  have 

tr (STV2f^{X)(S))  =  tr (StAS)  -  tr(NT5Afc)  +  fitr((STX)2  +  STXXTS). 

Let  all  the  columns  of  S  be  zero  except  that  the  j- th  column  is  the  unit  eigenvector  qt  of  A,  namely. 


S  =  qte J,  j  <  k,  ej  G 


which  satisfies  the  trace  constraint  tr(S’TS’)  =  1.  Clearly, 
5tX  =  ejq]Qk(I  -  Afe/zr)1/2  =  | 
It  is  not  difficult  to  verify  that 


ejej^/l  -  A i/n,  i<k, 

0,  otherwise. 


tr :(SlV2U(X)(S))  =  A <  -  Xj  +  +  7i)(//  -  Ai). 


(20) 
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where  5ij  is  the  Kronecker  delta  and  7*  =  1  if  i  <  k  and  0  otherwise.  For  different  choices  of  i  and  3  <  k, 
the  right-hand  side  of  (20)  can  take  the  values  of  A,  —  A j  for  i  >  k  >  j,  and  2 ( //  —  A, )  for  i  =  j  <  k.  These 
values  imply  the  bounds 

Amax  (v2/M(X))  >  max  (2(/z  -  Ai),  An  -  Ai) 

Amin  (v2/m(1))  <  min  (2(/x  -  Afc),  Afc+i  -  Xk) , 

from  which  we  obtain  (19)  and  complete  the  proof.  □ 

We  note  that  in  the  general  case  of  k  >  1 .  the  bound  in  (19)  can  be  further  tightened.  For  example,  by 
letting  i  <  j  =  k,  we  obtain  the  values  of //  —  A/,.;  hence  the  factor  2  can  be  eliminated  from  the  denominator. 
One  can  also  show  that  the  Hessian  becomes  singular  when  there  arc  eigenvalues  A  j,  j  <  k,  of  multiplicity 
greater  than  one.  In  general,  this  kind  of  non-uniqueness  does  not  necessarily  imply  a  higher  degree  of 
difficulty  in  solving  the  optimization  problem. 

Even  though  the  bound  in  (19)  is  not  tight,  it  is  good  enough  to  serve  the  purpose  of  demonstrating 
two  useful  points:  (a)  the  penalty  parameter  /j  should  not  be  chosen  excessively  close  to  Xk,  and  (b)  one 
should  expect  some  difficulty  arising  from  the  existence  of  a  relatively  tiny  gap  between  A/,,  and  Afc+i.  This 
second  difficulty,  caused  by  clustered  eigenvalues  at  a  critical  location,  represents  a  common  challenge  to 
eigenvalue  solvers. 

2.4  Extensions 

It  is  not  difficult  to  see  that  our  analysis  in  this  section,  as  well  as  the  algorithmic  framework  described  in 
the  next  section,  can  be  extended  to  the  generalized  eigenvalue  problem: 

min  tr (XTAX),  s.t.  XJBX  =  I,  (21) 

xeR"xfc 

where  B  is  symmetric  and  positive  definite.  In  this  case,  the  trace-penalty  minimization  model  is  simply 

min  f^X)  :=  Jtr  (XJAX)  +  1|  XTBX  -  lfF.  (22) 

XeRnXk  2  4 

In  fact,  by  change  of  variable  Z  =  B^X  (where  the  symmetric  matrix  /i  2  satisfies  B  ^  B  j  =  B),  the 
generalized  eigenvalue  problem  (21)  can  be  converted  to  a  standard  eigenvalue  problem 

min  tr (ZTAZ),  s.t.  ZT Z  =  I,  (23) 

zeRnxk 

where  A  =  B~2AB~2.  As  a  result,  our  model  analysis,  directly  applicable  to  (23),  can  be  translated  to 
(22)  in  a  straightforward  manner.  For  example,  Theorem  2.4  gives  the  global  minimizers  of  (23)  as 

Z  =  Qk(I-  Ak/fi)VT, 
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where  A/,.  is  diagonal  with  k  smallest  eigenvalues  of  A  on  its  diagonal  that  also  happen  to  be  the  generalized 
eigenvalues  of  the  matrix  pair  (A,  B),  and  Qk  consists  of  corresponding  eigenvector  of  A.  By  the  change  of 
variables  Z  =  B  2  X,  then 

X  =  Qk(I-  Ak/v)VJ,  (24) 

where  Q k  =  B~?Qk  consists  of  the  generalized  eigenvectors  associated  with  the  k  smallest  generalized 
eigenvalues  in  Ak.  Naturally,  the  equivalence  of  the  trace-penalty  minimization  model  (22)  to  the  trace 
minimization  model  (21)  requires  that  //  >  max(0,  A/,.)  where  Xk  is  a  /;:-th  smallest  generalized  eigenvalue 
of  the  matrix  pair  (A,  B). 

Another  useful  extension  is  to  find  eigenvectors  in  the  orthogonal  complement  of  the  column-space  of  a 
given  U  such  that  UTU  =  /,  that  is: 

min  tr (XTAX),  s.t.  XJBX  =  /,  UTX  =  0.  (25) 

The  variation  (25)  can  arise  from  a  deflation  procedure  where  U  is  constructed  from  already  converged 
eigenvectors.  The  trace-penalty  minimization  model  corresponding  to  (25)  is 

min  UX)  :=  -tr(XTAX)  +  ^\\XJBX  -  I  III,  s.t.  UTX  =  0.  (26) 

xgrx‘  2  4 

Starting  from  X°  such  that  UTX°  =  0,  a  projected-gradient  method  for  solving  (26)  has  the  form  XJ+l  = 
X'J  —  aJ  ( /  —  UUJ)X  fi,  (Xr)  which  is  just  a  slight  modification  of  regular  gradient  methods  to  be  discussed 
in  details  in  the  next  section. 

The  principle  of  trace-penalty  minimization  can  in  fact  be  applied  to  other  types  of  eigenvalue  problems, 
but  for  the  sake  of  space  we  will  leave  further  extensions  to  future  work. 

3  Algorithmic  Framework 

3.1  Gradient  Methods  for  Trace-Penalty  Minimization 

The  trace-penalty  minimization  model  proposed  in  the  previous  section  is  an  unconstrained  nonconvex  min¬ 
imization  problem.  There  are  many  well-studied  approaches  for  this  problem,  such  as  the  steepest  descent 
gradient,  the  conjugate  gradient,  the  Newton’s  and  Quasi-Newton  methods.  Considering  the  scale  of  the 
eigenvalue  computation  of  interest,  in  this  paper  we  focus  on  the  gradient-type  methods  of  the  form: 

X>+1  =  X>  -  ajVf»{Xj),  (27) 

where  the  superscript  j  denotes  the  j-th  iteration  and  aJ  is  the  step  size. 

Although  the  penalty  function  may  have  multiple  stationary  points.  Theorem  2.5  shows  that  when  /t  is 
chosen  slightly  above  Xk  >  0,  then  rank-k  stationary  points  arc  likely  to  be  all  global  minimizers.  The 
following  lemma  suggests  that  iterates  generated  by  (27)  will  most  likely  remain  full  rank. 
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Proposition  3.1.  Let  Xj3  1  be  generated  by  (27)  from  a  full-rank  iterate  XL  Then  XJ+ 1  is  rank-deficient 
only  if  1  j  a?  is  one  of  the  k  generalized  eigenvalues  of  the  problem: 

[(X yfVf^X^u  =  A  [(X*)T(X*)]u.  (28) 

On  the  other  hand,  if  a3  <  amin(X3)/\\X  f^X3)^,  then  X3+l  is  of  full  rank. 

Proof.  Suppose  that  X3+l  is  rank  deficient.  Then  there  exists  an  nonzero  vector  u  such  that  X3+ 1  u  =  0.  In 
view  of  (27),  we  have 


Xju  -  ajVf^(Xj)u  =  0. 


(29) 


Hence,  (28)  holds  under  A  =  1  fa?  after  multiplying  both  sides  of  (29)  by  (X3)T /a3 .  Due  to  the  full  rank 
of  XL  (X3  )](X'J)  is  positive  definite.  The  expression  of  the  gradient  in  (5)  implies  that  (X3)JX  ffl(X3) 
is  symmetric.  Therefore,  (28)  is  a  generalized  symmetric  eigenvalue  problem.  The  second  part  of  the 
proposition  follows  directly  from  (29).  □ 

We  next  present  a  few  strategies  for  choosing  the  step  size  aL  Given  an  arbitrary  direction  D  E  Mnxfc, 
the  objective  function  ftl{X  +  aD )  is  a  quartic  function  of  a;  precisely, 

U(X  +  aD)  =  hr (XJAX)  +  |tr (BTB)  +  (tr{DTAX)  +  hr{BTW)^  a 
+  (tr(DJAD)  +  hr(BJH)  +  ^tr (WTW)^j  a2 
+  (|tr (WJH)^)  a 3  +  (^tr(f7xf7))  a4,  (30) 

where  B  =  XTX  —  I,W  =  I)lX  +  XTD  and  H  =  I)[I).  The  steepest  descent  gradient  method  computes 
the  step  size  by  using  an-  one-dimensional  exact  minimization,  i.e.,  a 3  =  argmin  f), (X3  —  aXf(X3)), 
which  is  determined  by  a  root  of  the  cubic  equation  —  aX7  f{X3)/ da  =  0.  Note  that  //  >  0  and 

tr(  //  '  //)  >  0  for  Vf/f  X)  f  0,  a  positive  root  always  exits.  Although  executing  exact  line  searches  along 
each  steepest  descent  direction  often  converges  slowly,  it  has  been  demonstrated  in  [16,  17]  that  mixing  it 
with  some  other  step  sizes  in  an  alternative  fashion  can  accelerate  convergence  significantly. 

Another  successful  approach  is  to  use  line  search  with  a  Barzilai-Borwein  (BB)  size  [3].  Let 

S3  :=  X3  -  X3-1  and  Yj  =  Xf^X3)  -  Xf^X3~l).  (31) 


The  BB  step  size  is 

_  tr m  or  ^  _  llffllli  (32) 

BB1“  Ill'll?  BB2_tr  ((Si)TYi). 

Since  S 3  =  aJ_1V//i(X-J_1),  the  computation  of  the  BB  step  sizes  only  requires  to  store  one  intermediate 
matrix  Y3  in  (31).  When  n  and  k  arc  huge  or  when  storage  becomes  a  critical  factor,  one  can  still  compute 
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a  so-called  part  ial  BB  step  size  by  using  (32)  but  with  a  pre-selected  small  subset  of  columns  of  both  Sj  and 
YJ .  making  the  storage  of  an  extra  Y-matrix  unnecessary.  A  simple  heuristic  line  search  scheme  that  we 
will  use  is  to  shorten  the  step  size,  whenever  necessary,  by  back-tracking  a?  =  a8h ,  where  a  is  one  of  the 
BB  step  sizes  in  (32),  <5  E  (0, 1)  and  h  is  the  smallest  positive  integer  satisfying  the  condition 

-a8hVfl)<  2/„(X*).  (33) 

It  is  known  that  certain  global  convergence  properties  can  be  guaranteed  in  theory  by  more  elaborate  line 
search  conditions  such  as  non-monotone  line  search  conditions  in  [7,  9,  18],  We  have  found,  however,  that 
on  our  trace-penalty  function  //t(X)  condition  (33)  has  performed  efficiently  and  reliably. 

At  the  end  of  trace-penalty  minimization,  a  Rayleigh-Ritz  (RR)  step  is  necessary  to  compute  Ritz-pairs 
as  approximations  to  eigenpairs.  Specifically,  in  our  context  the  RR  step  corresponding  to  a  given  matrix 
X  E  Wixk  is  defined  by  the  following  steps. 

1.  Orthogonalize  and  normalize  X  to  obtain  U  so  that  UTU  =  I. 

2.  Compute  the  projection  UTAU  and  its  eigenvalue  decomposition  YTEY. 

3.  Assemble  the  Ritz-pairs  into  the  matrix-pair  (Y,  E)  where  Y  =  UV . 

For  convenience,  we  will  refer  the  above  RR  procedure  as  a  map  (Y,  E)  =  RR(X). 

In  Algorithm  1  below,  we  specify  a  basic  version  of  a  method  for  trace-penalty  minimization,  called 
“EigPen-B”,  which  uses  the  first  BB  step  formula  in  (32),  the  simple  line  search  condition  (33),  and  a 
termination  rule 

||V/m(Xj)||f  <  e,  (34) 

where  e  >  0  is  a  prescribed  tolerance. 

Algorithm  1:  Eigenspace  by  Penalty  -  basic  version  (EigPen-B) 

Initialize  X°  E  MAxk  and  estimate  n  E  (Afc,  An).  Set  e,  6  E  (0, 1)  and  j  =  0. 

Compute  initial  step  a  =  ||X°||if/||V/m(X0)||,f  . 
while  ||V/M(X^')||F>e  do 

compute  the  smallest  natural  number  h  so  that  a5h  satisfying  (33); 
update  Xi+1  =  X?  -  <YV /^(X-?); 
compute  a  using  the  first  formula  in  (32); 
increment  j  and  continue. 

Execute  the  RR  procedure  (X,  E)  =  R R(XJ). 


The  memory  requirement  of  Algorithm  1  is  summarized  as  follows.  Four  n  by  k  matrices,  X,  AX, 
V./j/fX )  and  Y  defined  in  (31),  are  required.  As  mentioned  earlier,  the  need  for  storing  Y  can  be  essen¬ 
tially  eliminated  if  partial  BB  step  sizes  are  computed  (without  obvious  performance  degradation  in  our 
experiments).  Using  the  convention  that  an  m  x  p  matrix  times  a  p  x  n  matrix  costs  2 mnp  flops,  we  sum¬ 
marize  the  computational  complexity  of  various  tasks  as  follows.  Let  s  be  the  sparsity  of  the  matrix  A, 
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i.e.,  s  =  | {  A ij  |  A j.j  0} | /n2.  The  matrix-matrix  multiplication  AX  needs  2 sn2k  flops  while  sn2  is  often 
around  0(n).  Forming  X (XT X  —  I)  takes  4 nk2.  Each  of  the  inner  product  tr(',S'1V)  and  the  calculation  of 
the  norms  ||V/m(X)||f  and  ||F||f  takes  2 nk.  Hence,  the  total  cost  of  each  step  in  trace-penalty  minimiza¬ 
tion  is  at  most  4 nk2  +  8 nk  +  2 sn2k  +  0{k3).  The  orthogonalization  step  by  using  Cholesky  factorizations 
needs  2 nk2  +  Oik3)  since  XTX  comes  free  from  trace-penalty  minimization  steps.  The  projection  UJ All 
requires  2 nk  +  2 sn2k.  The  eigenvalue  decomposition  of  the  projection  in  an  RR  step  takes  Q(k3).  Fi¬ 
nally,  assembling  the  Ritz-pairs  takes  another  2 nk2.  Therefore,  the  total  complexity  of  the  last  RR  step  is 
6  nk2  +  2  sn2k  +  0{k3). 

3.2  Enhancement  by  Restarting 

Algorithm  EigPen-B  often  works  quite  well  in  practice.  However,  a  typical  behavior  of  gradient  methods 
is  that  they  can  reduce  the  objective  function  rather  rapidly  at  an  initial  stage,  but  the  amount  of  reduction 
can  become  extremely  small  as  iterates  get  closer  to  a  solution.  In  trace-penalty  minimization,  it  has  been 
observed  that  restarting  the  gradient  method  with  a  modified  X  can  usually  help  accelerate  convergence  and 
achieve  a  higher  accuracy  more  quickly.  In  this  subsection,  we  describe  a  restarting  strategy  for  trace-penalty 
minimization  that  utilizes  more  than  one  RR  step.  In  addition  to  accelerating  convergence,  the  restarting 
strategy  provides  a  more  reliable  termination  procedure  by  examining  more  than  one  set  of  Ritz-pairs. 

We  now  demonstrate  how  RR  steps  can  help  speed  up  trace-penalty  minimization.  Let 

Y  =  argmin{//i(X)  :  X  £  5},  (35) 

xeRrixfc 

where  X  £  S  means  that  every  column  of  X  is  in  the  subspace  S.  Let  XJ  be  the  iterate  generated  by  the 
EigPen-B  algorithm  after  J  iterations.  Clearly,  as  long  as  X  J  £  S  there  holds 

U(X)  <  f,t(XJ).  (36) 

On  the  other  hand,  consider  the  subspace  trace  minimization  problem 

U  =  argmin  (tr(XTAA:)  :  ATtX  =  /,  X  £  S]  ,  (37) 

.YeRnXd 

where  d  is  the  dimension  of  the  subspace  S.  Clearly,  the  RR  step  (U,  E)  =  RR(X'7)  is  equivalent  to 
solving  (37)  for  S  =  span  {A'7}  (assuming  that  XJ  £  Mnxfc  has  full  rank)  so  that  UT  A  U  =  E  is  diagonal 
(otherwise,  replace  U  by  UV  where  UJ AU  =  EE  V/T). 

We  now  show  that  a  “better  point”  Y  for  trace-penalty  minimization  in  (35)  and  (36)  can  be  explicitly 
constructed  from  the  RR  step  output  ( U ,  E)  =  RR(XJ).  We  first  consider  the  simple  case  S  =  span{  A^’7}. 

Lemma  3.2.  Let  S  =  span  {A"'7}  where  XJ  £  Mnx7:  has  full  rank,  and  let  U  be  defined  in  (37)  so  that 
UTAU  =  E  is  diagonal.  Then  a  Y  in  (35)  has  the  form  Y  =  U(I  —  E/^r)1/2,  provided  that  pi  y  E. 

Now  we  prove  a  more  general  result  that  contains  the  above  as  a  special  case. 
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Lemma  3.3.  Let  S  D  span{AJ}  have  dimension  d  >  k,  and  U  be  defined  in  (37)  so  that  UTAU  =  X  is 
diagonal  whose  diagonal  elements  are  arranged  in  an  ascending  order.  Then  a  matrix  Y  in  (35)  has  the 
form  Y  =  UfdJ  where  Uj.  consists  of  the  first  k  columns  ofJJ,  and  D  E  Mfcxfc  is  a  diagonal  matrix  whose 
i-th  diagonal  element  is 

(  i  y/2 

Du  =  max  (  0, 1  -  J  ,  i  =  1,2,---  ,  k.  (38) 

Proof.  Since  U  E  Mnxd  is  a  basis  of  S.  the  solution  of  (35)  can  be  expressed  as  X  =  UW  for  some 
W  E  Wdxk.  Substituting  X  =  UW  into  (35)  and  noting  that  UTAU  =  X  and  UTU  =  /,  we  reduce  (35)  to 

min  f^UW)  =  *tr(IUTXIU)  +  ^||  WTW  -  lfF.  (39) 

Using  the  fact  that  X  is  a  diagonal  matrix,  it  can  be  verified  (see  Theorem  2.4)  that  W  =  (^D  oj  ,  with 
the  diagonal  matrix  D  defined  as  in  (38),  is  indeed  a  solution  of  (39).  Therefore,  Y  =  UW  =  U^D.  □ 

In  Algorithm  2  below,  we  present  our  trace-penalty  minimization  algorithm  with  restarting,  which  is 
used  to  perform  numerical  experiments  presented  in  the  next  section.  The  algorithm,  called  EigPen,  contains 
two  loops.  The  inner  loop  is  stopped  once  the  condition 

||V/„(X0||F  <  Ci  max(l,  ||A^'||F)  (40) 

is  met,  where  et  E  (0, 1)  is  a  prescribed  tolerance.  Then  an  RR  step  is  executed  to  construct  Ritz-pairs  and 
termination  criteria  are  checked  for  the  outer  loop.  If  the  algorithm  does  not  stop,  then  a  smaller  tolerance 
Cj+i  =  Se€i  is  set  where  5e  E  (0, 1),  and  a  better  iterate  is  constructed  from  which  the  algorithm  restarts  the 
next  round  of  inner  iterations  by  calling  EigPen-B. 


Algorithm  2:  Eigenspace  by  Penalty  -  enhanced  version  (EigPen) 

Initialize  A"0  E  M.nxk  and  estimate  /a  E  (A&,  An).  Set  eo,  5,  Se  E  (0, 1)  and  i  =  j  =  0. 
while  “not  converged"  do 

Set  Xi  =  Xi  and  compute  a  =  \\Xf\F /\\V  f^(Xi)\\F. 
while  ||V/m(X?’)||f  >  a  ■  max(l,  HAA^jf)  do 

compute  the  smallest  integer  h  so  that  aJ  =  adh  satisfying  (33); 
update  AJ+1  =  AJ  -  cUV//i(AA'); 
compute  a  using  the  first  formula  in  (32); 
increment  j  and  continue. 

Execute  the  RR  procedure  (A,  X)  =  RR(AJ)  and  let  A*+1  =  A(J  —  X//x)2 . 
Update  the  tolerance  e,+i  =  Se€i  and  increment  i. 


The  RR  restart  approach  allows  flexibility  to  integrate  other  techniques  into  EigPen.  For  example,  at  the 
j-th  iteration,  if  one  chooses  the  subspace  S  in  problem  (37)  to  be  S  =  span  {XJ~f  Xf  /1AJ},  then  the 
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RR  step  would  generate  a  step  similar  to  those  in  the  LOBPCG  algorithm  [10].  A  key  difference  between 
LOBPCG  and  EigPen  is  that  RR  steps  constitute  the  main  workhorse  of  the  former,  but  are  utilized  only  a 
few  times  in  the  latter. 

3.3  Penalty  Parameter  Adjustment 

We  now  describe  our  approach  to  choosing  penalty  parameter  //.  Theorem  2.4  states  that  /<  >  nrax(0,  A/,.)  is 
necessary  and  sufficient  for  the  equivalence  between  (1)  and  (2).  A  more  restrictive  range  for  /t  is  given  in 
Theorem  2.5  that  eliminates  all  full-rank  stationary  points  but  the  global  minimizers.  However,  it  requires 
the  extra  work  of  estimating,  at  the  least,  A/.+ 1 .  On  the  other  hand,  both  Lemmas  2.6  and  2.7  suggest  that 
//  should  not  be  too  close  to  A/.,  otherwise  ill-conditioning  could  arise  in  trace-penalty  minimization.  On 
balance,  we  adopt  a  tractable  strategy  of  choosing  //  >  A/,-  (which  is  positive  after  a  shifting  if  necessary)  and 
keeping  it  reasonably  close  to  A/,- ,  without  attempting  to  make  \i  smaller  than  the  next  smallest  eigenvalue. 

Given  an  initial  matrices  X°  6  Rnx*'  whose  columns  arc  normalized,  the  kth  smallest  eigenvalue  Afc  can 
be  estimated  by  the  maximal  value  of  the  diagonal  entries  of  (X0)TAX°,  which  provides  an  initial  choice 

//  =  max(ci,  C2  nrax(diag((X0)TAX0))),  (41) 

where  ci  >  0  and  C2  >  1  arc  two  constants  for  safeguarding.  Another  estimation  of  //  comes  from  the 
structure  of  the  minimizer  X  given  in  (12),  which  yields  the  eigenvalue  decomposition 

/i(/  -  XTX)  =  VkkVT,  (42) 

where  V  and  A/,,  are  defined  in  Theorem  2.4.  Once  a  “good”  iterate  X:l  is  at  hand  after  some  iterations  dur¬ 
ing  trace-penalty  minimization,  the  relationship  (42)  implies  that  A/,:  can  be  estimated  from  the  maximum 
eigenvalues  of  I  —  (AJ)TXA  i.e.,  XJk  =  /./Amax(/  —  (X^X7).  Since  the  computational  cost  of  approx¬ 
imating  the  largest  eigenvalue  of  a  k  x  k  matrix  is  relatively  low,  the  penalty  parameter  //  can  be  updated 
during  trace-penalty  minimization  by  the  formula 

At  =  max ( ci,  c2A£).  (43) 

A  more  accurate  estimate  of  Xk  becomes  available  after  an  RR  step  is  executed  and  the  k- th  Ritz- value  6k  is 
at  hand.  Then  we  use  the  formula 


\i  =  max(d,  c20k).  (44) 

In  favorable  cases  where  the  gap  between  Xk  and  A^+i  is  relatively  large,  our  strategy  of  choosing  n 
slightly  larger  than  Xk  would  have  a  good  chance  to  satisfy  both  fi  >  0  and  fi  e  (Afc,  A/,.+1 ),  provided  that 
Afc  >  0.  In  order  to  ensure  Afc  >  0,  our  current  strategy  is  to  first  scale  the  matrix  A  by  a  «  |Ai|,  assuming 
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that  Ai  <  0,  and  then  add  a  positive  shift  oj  >  1,  obtaining 

A  =  -A  +  ujI  (45) 

a 

that  is  at  least  close  to  being  positive  semidefinite.  After  performing  trace-penalty  minimization  to  A,  the 
above  scale  and  shift  can  be  easily  reversed  to  recover  the  eigenvalues  of  A. 

To  estimate  Ai,  we  note  that  the  well-known  Gershgorin  circle  theorem  implies  that 


Ai  >  ui  :=  min  <  An  -  \Aij\  > 

l  ) 


In  addition,  the  relationship  between  matrix  norms  implies  that 

max(|| A||oo,  ||A||f) 


u2  ■= 


n 


<  \\A\\2  =  max(|Ai|,  |An|). 


Hence,  without  too  much  computation  a  reasonable  value  of  cr  in  (45)  is  taken  as 


a  =  max(min(|wi|,  u2),  1). 


(46) 


As  long  as  a  is  not  much  smaller  than  |Ai|,  setting  a;  to  a  moderate  number  between  1  to  10  usually  works 
well  in  our  tests.  In  our  numerical  experiments,  we  always  take  the  safe  value  of  uj  =  10. 


4  Numerical  Experiments 

In  this  section,  we  test  the  performance  of  EigPen  as  a  general  solver  for  computing  a  set  of  smallest 
eigenvalues  and  their  corresponding  eigenvectors  of  sparse  matrices. 

4.1  Solvers,  Test  Matrices  and  Platform 

We  choose  to  compare  EigPen  with  the  implicitly  restarted  Lanczos  method  in  ARPACK1  and  the  locally 
optimal  preconditioned  conjugate  gradient  (LOBPCG)  algorithm.  All  algorithms  arc  implemented  in  Fortran 
and  parallelized  by  using  OpenMP  except  that  preconditioning  for  EigPen  is  demonstrated  in  MATLAB.  We 
use  an  implementation  of  LOBPCG,  previously  developed  by  the  second  author  of  this  paper,  instead  of  the 
BLOPEX2  package  since  the  performance  of  BLOPEX  seems  not  quite  as  stable  and  efficient  as  our  own 
version  on  our  test  platform. 

We  select  a  set  of  thirteen  test  matrices  arising  from  the  density  functional  theory  (DFT)  for  electronic 
structure  calculation.  They  arc  all  sparse  matrices3  whose  dimension  n,  the  number  of  nonzero  components 

'Downloadable  from  http://www.caam.rice.edu/software/ARPACK 
2Downloadable  from  http://code.google.eom/p/blopex 
’Downloadable  front  http://www.cise.ufl.edu/research/sparse/matrices 
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nnz  and  sparsity  are  listed  in  Table  1.  Many  of  them  arc  produced  by  PARSEC  [1 1],  a  real  space  DFT  based 
code  in  which  the  Hamiltonian  is  discretized  by  using  finite  difference. 


Table  1 :  Problem  characteristics 


Name 

n 

nnz 

too  x 

Andrews 

60000 

410077 

0.01% 

C60 

17576 

212390 

0.07% 

c_65 

48066 

204247 

0.01% 

cfdl 

70656 

948118 

0.02% 

finance 

74752 

335872 

0.01% 

GalOAsl()H3() 

113081 

3114357 

0.02% 

Ga3As3H12 

61349 

3016148 

0.08% 

OPF3754 

15435 

82231 

0.03% 

shallow  .water  1 

81920 

204800 

<0.01% 

SilOH16 

17077 

446500 

0.15% 

Si5H12 

19896 

379247 

0.10% 

SiO 

33401 

675528 

0.06% 

wathenlOO 

30401 

251001 

0.03% 

We  perform  most  of  our  numerical  experiments  on  a  single  node  of  Hopper4,  a  Cray  XE6  supercomputer 
maintained  at  the  National  Energy  Research  Scientific  Computer  Center  (NERSC)  in  Berkeley.  The  node 
consists  of  two  twelve-core  AMD  “MagnyCours”  2.1-GHz  processors  with  a  total  of  32  gigabyte  (GB) 
shared  memory.  However,  memory  access  bandwidth  and  latency  are  nonuniform  across  all  cores.  Each 
core  has  its  own  64  kilobytes  (KB)  LI  and  512  KB  L2  caches.  One  6-MB  L3  cache  shared  among  6 
cores  on  the  Magny-Cours  processor.  There  are  four  DDR3  1333-MHz  memory  channels  per  twelve-core 
“MagnyCours”  processor. 

We  use  the  multi-threaded  version  of  the  Cray  Scientific  Libraries  package,  LibSci,  which  includes 
multi-threaded  versions  BLAS  and  LAPACK  subroutines  optimized  for  Cray  XE6.  Each  sparse  matrix  vec¬ 
tor  multiplication  (SpMV)  in  ARPACK  is  parallelized  through  OpenMP  threads.  While  each  individual 
SpMV  is  not  parallelized  in  EigPen  and  LOBPCG,  a  loop  level  parallelization  is  applied  to  the  loop  that 
produces  the  columns  of  the  product  AX  where  X  contains  multiple  columns.  We  also  implement  a  block 
version  of  the  Davidson  algorithm.  Without  using  a  preconditioner,  the  algorithm  is  essentially  a  steepest 
descent  algorithm  directly  applied  to  (1).  Because  its  performance  in  our  tests  has  been  found  to  be  clearly 
poorer  compared  to  other  algorithms  discussed  in  this  section,  we  do  not  include  its  timing  measurements 
in  the  numerical  results  presented  in  this  section.  The  block  Krylov-Schur  algorithm  [21]  and  the  (block) 
Chebyshev-Davidson  algorithm  [19,  20]  are  not  included  in  our  comparison,  partly  because  suitable  For- 
tran/OpenMP  implementations  of  these  algorithms  were  unavailable  for  our  tests. 


4.2  Termination  Rules  and  EigPen  Parameters 

All  tests  on  the  aforementioned  machine  Hopper  arc  run  as  batch  jobs  with  a  maximum  wall  clock  time  limit 
of  6  hours.  We  terminate  all  algorithms  when  the  relative  residual  norm  (defined  below)  for  every  Ritz-pair 

4Detailed  information  is  available  at  http://www.nersc.gov/users/computational-systems/hopper/ 
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(■ Ui ,  0, )  is  smaller  than  a  prescribed  tolerance  tol,  that  is. 


res  i(U) 


II Am  -  9jUi\\2  , 

max(l,  \6i\)  ~° 


i  =  !,•••,  nev, 


(47) 


where  nev  is  the  number  of  smallest  eigenvalues  to  be  computed,  ut  is  the  7-th  column  of  U  that  satisfies 
UJU  =  I,  and  0r  =  uf  Aui  (recall  that  for  EigPen  we  need  to  perform  an  RR  step  to  obtain  the  Ritz-pair). 
We  also  terminate  an  algorithm  when  the  number  of  iterations  reaches  a  maximum  of  10,000,  but  this  limit 
was  never  reached  in  our  experiments. 

In  EigPen,  the  initial  penalty  parameter  //  is  computed  by  (41)  and  it  is  updated  by  (43)  at  most  three 
times  in  the  first  outer  iteration  of  EigPen.  After  an  RR  step  is  executed,  p,  is  set  according  to  (44)  and 
is  fixed  throughout  the  next  round  of  inner  iterations.  The  constants  c\  =0.1  and  C2  =  1.1  are  used  in 
(41),  (43)  and  (44).  The  initial  tolerance  e°  is  set  to  tol  and  the  backtracking  constant  6  is  set  to  0.25.  The 
parameter  d€  is  adjusted  dynamically  according  to  the  number  of  the  converged  eigenvectors  (denoted  by 
hi)  that  satisfy  the  condition  res,.  <  tol,  for  i  =  1 .... .  nev,  after  each  RR  step: 


0.1, 

if  k\  =  0, 

0.5, 

if  k\  <  0.9  nev , 

0.6, 

if  k\  <  0.95  nev 

0.7, 

otherwise. 

As  is  already  mentioned,  in  order  to  facilitate  the  selection  of  our  penalty  parameter  //  in  EigPen  we  perform 
scaling  and  shifting  as  in  (45)  where  a  is  given  by  (46)  and  oj  =  10  is  always  used. 


4.3  Overall  Performance 

We  first  report  the  overall  performance  of  ARPACK,  LOBPCG  and  EigPen  on  the  test  matrices  listed  in 
Table  1.  The  number  of  smallest  eigenvalues  (nev)  to  be  computed  is  roughly  1%  of  the  dimension  of  A. 
All  algorithms  arc  run  in  parallel  with  24  cores.  No  preconditioner  is  used  in  these  tests. 

For  LOBPCG  and  EigPen,  we  set  the  dimension  of  X  (denoted  by  k )  to  be  slightly  larger  than  nev  to 
improve  the  convergence.  Specifically,  k  is  set  to  nev  x  1.1  (round  to  the  nearest  integer).  For  ARPACK,  we 
set  the  parameter  nev  =  nev  +  100,  which  is  the  dimension  of  the  Krylov  subspace  constructed  to  extract 
desired  eigenvalue  approximations.  The  number  100  is  simply  the  degree  of  the  polynomial  constructed 
implicitly  in  each  restart  to  filter  the  unwanted  spectral  components  from  the  stalling  vector.  Setting  nev 
to  a  larger  value  tends  to  reduce  the  number  of  restarts,  but  making  each  restart  more  costly.  The  optimal 
value  nev  that  achieves  the  best  tradeoff  between  the  number  of  restarts  and  the  amount  of  work  per  restart 
is  generally  difficult  to  determine  a  priori. 

Our  experiments  arc  performed  using  two  different  tolerance  values  tol  =  10”2  and  tol  =  1CP4.  The 
total  wall  clock  times  taken  by  the  three  solver  arc  presented  in  Table  2.  Whenever  the  code  terminates 
abnormally  (either  the  run  is  stopped  prematurely  or  the  maximum  wall  clock  time  limit  of  6  hours  is 
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reached),  the  corresponding  entry  in  the  table  is  marked  by  “ — From  Table  2,  we  observe  that  ARPACK 
did  not  succeed  on  matrices  c_65,  cfdl  and  Gal0Asl0H30  for  both  tolerance  values,  and  LOBPCG  did 
not  succeed  on  matrices  c_65,  Gal0Asl0H30  and  Ga3As3H12  for  tol  =  10-4.  Even  when  it  converges, 
on  larger  problems  ARPACK  typically  takes  longer  to  run  than  either  EigPen  or  LOBPCG  (which  may  be 
attributable  to  a  limited  parallel  scalability  in  sparse  matrix-vector  multiplications).  In  general,  EigPen  is 
faster  that  LOBPCG,  especially  on  larger  problems,  with  only  one  significant  exception  on  the  matrix  c_65 
for  tol  =  10-2. 


Table  2:  A  comparison  of  total  wall  clock  time  (“ — ”  arc  abnormal  terminations) 


tol  =  10~^ 

tol  =  10-4 

Matrix 

nev 

ARPACK 

LOBPCG 

EigPen 

ARPACK 

LOBPCG 

EigPen 

Andrews 

600 

2956 

575 

159 

3344 

1160 

496 

C60 

200 

57 

29 

19 

59 

52 

44 

c_65 

500 

-- 

331 

3099 

-- 

r- 

10030 

cfdl 

700 

-- 

815 

233 

-- 

2883 

1547 

finance 

700 

9903 

968 

472 

16120 

4629 

1122 

Gal0Asl0H30 

1000 

-- 

5390 

1848 

-- 

-- 

5531 

Ga3As3H12 

600 

4871 

111 

563 

6587 

-- 

1600 

OPF3754 

200 

23 

8 

10 

23 

28 

17 

shallow  _waterl 

800 

2528 

642 

215 

18590 

3849 

951 

SilOH16 

200 

73 

77 

24 

78 

too 

86 

Si5H12 

200 

103 

86 

24 

114 

114 

38 

SiO 

400 

789 

265 

81 

840 

1534 

287 

wathenlOO 

300 

828 

219 

89 

869 

1103 

219 

The  minimal,  average  and  maximal  number  of  the  RR  steps  performed  by  EigPen  in  this  set  of  tests  is, 
respectively,  3,  5  and  9.  As  will  be  shown  later,  the  cost  of  the  RR  steps  only  accounts  for  a  very  small 
portion  of  the  total  cost  of  EigPen. 

We  next  show  the  accuracy  of  the  computed  eigenpairs,  as  well  as  that  of  the  computed  minimum  trace 
values.  We  should  point  out  that  when  tol  is  relatively  large,  the  7-th  Ritz  value  6,  may  be  closer  to  \j  for 
j  >  i  than  to  At.  In  this  case,  we  may  miss  some  eigenvalues  even  though  the  convergence  criterion  (47)  is 
satisfied  for  all  i  <  nev.  To  measure  the  accuracy  of  the  computation,  we  compute  the  relative  difference 
between  0,  and  the  true  eigenvalue  A,  computed  in  advance  by  ScaLAPACK  [4].  The  maximum  relative 
errors  among  all  eigenvalues,  which  is  measured  by 


err  g  = 


max 

i=l,...,nev 


I Oi  —  Ai| 
max(l,  |Ai|)  ’ 


are  reported  in  Table  3,  and  the  relative  errors  between  the  sum  of  the  nev  eigenvalues,  defined  by 


errtrace 


Enev  n  sr^nev  \  \ 

i=  l^i  2^=1  M 

max(l,|£r=iAt|) 


are  presented  in  Table  4.  From  these  tables,  we  see  that  LOBPCG  and  EigPen  achieve  the  same  level  of 
accuracy  on  most  problems.  Compared  with  the  other  two  solvers,  the  accuracy  of  ARPACK  is  worse  on 
most  matrices  when  tol  =  1 0  2  and  somewhat  better  on  about  half  of  the  matrices  when  tol  =  10  4. 

To  measure  the  accuracy  of  the  approximate  eigenvectors,  we  also  report  the  maximum  residual  error 
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Table  3:  A  comparison  of  err g  among  different  solvers 

Matrix 

tol  =  10  2 

tol  =  10~4 

ARPACK 

LOBPCG 

EigPen 

ARPACK 

LOBPCG 

EigPen 

Andrews 

1.51e-02 

2.78e-04 

2.31e-03 

4.54e-06 

4.58e-05 

4.58e-05 

C60 

4.48e-06 

1.78e-04 

5.88e-04 

4.48e-06 

4.34e-05 

4.34e-05 

c_65 

— 

2.43e-04 

1.52e-03 

— 

— 

4.79e-05 

cfdl 

— 

2.72e-03 

6.33e-03 

— 

5.37e-07 

3.90e-06 

finance 

5.19e-02 

1.28e-03 

4.78e-03 

7.59e-05 

4.80e-05 

4.80e-05 

Gal0Asl0H30 

— 

2.98e-04 

3.83e-04 

— 

— 

4.97e-05 

Ga3As3H12 

3.02e-02 

1.70e-04 

9.15e-04 

5.50e-03 

— 

4.69e-05 

OPF3754 

3.59e-06 

6.74e-04 

8.80e-04 

3.59e-06 

2.22e-05 

2.22e-05 

shallow  _waterl 

3.96e-01 

5.75e-03 

5.80e-03 

3.85e-04 

8.64e-06 

8.42e-06 

SH0H16 

2.83e-02 

5.01e-05 

9.77e-05 

2.53e-02 

4.33e-05 

4.33e-05 

Si5H12 

5.52e-02 

6.11e-05 

3.35e-04 

4.38e-06 

3.86e-05 

3.86e-05 

SiO 

2.40e-02 

9.14e-05 

1.42e-03 

4.43e-06 

4.81e-05 

4.81e-05 

wathenlOO 

5.27e-03 

5.74e-05 

9.11e-04 

3.93e-06 

3.17e-05 

3.17e-05 

Table  4:  A  comparison  of 

en'tracf  among  different  solvers 

Matrix 

tol  =  10~'J 

tol  =  10-4 

ARPACK 

LOBPCG 

EigPen 

ARPACK 

LOBPCG 

EigPen 

Andrews 

1.01e-03 

1.48e-05 

1 ,06e-04 

5.87e-09 

1 ,07e-07 

1.07e-07 

C60 

1.72e-07 

4.20e-06 

7.48e-06 

1.72e-07 

9.01e-08 

9.01e-08 

c_65 

— 

5.73e-06 

2.98e-05 

— 

— 

8.20e-07 

cfdl 

— 

2.36e-01 

5.37e-01 

— 

1 ,43e-06 

l.lle-05 

finance 

8.13e-03 

1.18e-04 

4.99e-04 

2.35e-07 

3.91e-07 

3.91e-07 

Gal0Asl0H30 

— 

1.33e-05 

1 .50e-05 

— 

— 

3.10e-07 

Ga3As3H12 

1.82e-03 

8.57e-06 

5.27e-05 

6.38e-05 

— 

1.01e-06 

OPF3754 

3.47e-09 

4.87e-05 

1 ,77e-05 

3.47e-09 

8.84e-08 

8.84e-08 

shallow  _waterl 

1.30e-01 

2.38e-03 

1 ,77e-03 

2.03e-05 

1 ,70e-07 

1.49e-07 

Sil0H16 

2.97e-03 

7.60e-06 

1 ,97e-06 

1.91e-03 

3.16e-06 

3.16e-06 

Si5H12 

1.58e-03 

7.62e-06 

1.88e-05 

2.12e-07 

5.50e-07 

5.50e-07 

SiO 

7.61e-04 

6.01e-06 

3.89e-05 

1.72e-07 

1 ,29e-06 

1.29e-06 

wathenlOO 

2.66e-04 

1.18e-05 

2.43e-05 

6.94e-08 

3.05e-07 

3.05e-07 

defined  by 


errres  =  max  res^ 

i=l,...,nev 

in  Table  5.  We  observe  that  EigPen  ususally  returns  a  slightly  smaller  residual  error  than  LOBPCG  in  this 
set  of  tests. 


4.4  Performance  profile  and  dependency  on  eigenspace  dimension 

In  this  subsection,  we  examine  how  LOBPCG  and  EigPen  perform  when  the  number  of  desired  eigenpairs 
(nev)  increases.  For  brevity,  we  only  show  results  for  two  matrices  Andrews  and  Ga3As3H12,  but  similar 
profiles  can  be  observed  for  other  matrices  as  well.  We  exclude  ARPACK  from  this  comparison  because 
ARPACK  is  not  as  competitive  as  either  LOBPCG  or  EigPen  for  relatively  large  values  of  nev,  as  is  shown 
by  the  results  of  the  previous  subsection. 

In  the  following  experiments,  we  set  the  convergence  tolerance  to  tol  =  10-2  and  nev  to  a  set  of  in¬ 
creasing  values  {500, 1000, 1500,  2000,  2500,  3000}.  We  run  both  solvers  on  24  cores.  The  wall  clock  time 
measurements  are  plotted  against  nev  for  both  solvers  in  Figure  1 .  We  observe  that  in  this  test  EigPen  always 
takes  a  less  amount  of  time  to  run  than  LOBPCG  does.  The  difference  in  wall  clock  time  increases  quickly 
as  nev  increases.  This  observation  suggests  that  the  benefit  of  using  EigPen  become  increasingly  greater 


23 


Table  5 :  A  comparison  of  err,Y,,s  among  different  solvers 


Matrix 

tol  =  lO”2 

tol  =  10-4 

ARPACK 

LOBPCG 

EigPen 

ARPACK 

LOBPCG 

EigPen 

Andrews 

9.10e-03 

9.58e-03 

9.30e-03 

3.96e-05 

9.20e-05 

3.14e-05 

C60 

1.77e-04 

9.83e-03 

5.23e-03 

8.67e-07 

9.31e-05 

7.59e-05 

c_65 

-- 

9.60e-03 

9.72e-03 

-- 

-- 

7.22e-05 

cfdl 

-- 

9.50e-03 

6.22e-03 

-- 

9.80e-05 

5.84e-05 

finance 

9.43e-03 

9.94e-03 

8.23e-03 

5.86e-05 

9.96e-05 

7.29e-05 

Gal0Asl0H30 

-- 

9.97e-03 

5.99e-03 

-- 

-- 

2.72e-05 

Ga3As3H12 

7.92e-03 

8.61e-03 

8.79e-03 

7.43e-05 

-- 

3.73e-05 

OPF3754 

1.19e-04 

9.21e-03 

5.34e-03 

8.21e-05 

4.96e-05 

7.55e-05 

shallow  .water  1 

1.00e-02 

9.90e-03 

6.70e-03 

8.90e-05 

9.61e-05 

3.96e-05 

SilOH16 

2.44e-03 

8.90e-03 

3.44e-03 

1.45e-05 

9.01e-05 

6.54e-05 

Si5H12 

1.99e-03 

9.56e-03 

6.51e-03 

4.56e-05 

9.78e-05 

8.98e-05 

SiO 

1.37e-03 

1.00e-02 

9.11e-03 

3.28e-06 

9.46e-05 

1.03e-05 

wathenlOO 

9.96e-03 

9.60e-03 

6.22e-03 

1.13e-05 

7.72e-05 

9.95e-05 

as  nev  becomes  larger.  The  key  reason  that  EigPen  performs  much  better  than  LOBPCG  for  large  nev  is 
that,  by  performing  far  fewer  RR  steps,  it  is  able  to  leverage  BLAS3  operations  that  are  highly  optimized 
for  Hopper  (and  other  high  performance  computers). 


(a)  Andrews  (b)  Ga3As3H12 

Figure  1 :  A  comparison  of  wall  clock  times  by  LOBPCG  and  EigPen  to  compute  nev  eigenpairs  of  the 
matrices  Andrews  and  Ga3As3H12  as  nev  increases. 

In  Figure  2,  we  show  run  times  of  four  categories:  sparse  matrix  vector  multiplications  (SpMV),  dense 
matrix-matrix  operations  (BLAS3,  the  DGEMM,  DSYRK,  DPOTRF ,  DTRSM  subroutines  in  BLAS  and  LA- 
PACK),  Rayleigh-Ritz  (RR,  the  DSYGVD,  DSYEVD ,  DGESVD  subroutines  in  LAPACK)  calculations,  and 
matrix  copying  (the  DLACPY  subroutine  in  LAPACK).  These  are  the  major  computational  components  of 
both  EigPen  and  LOBPCG,  albeit  in  different  proportions.  Two  clarifications  are  in  order  here.  Firstly,  we 
categorize  these  subroutines  only  at  the  highest  solver  level.  As  such,  any  call  to  DGEMM  inside  the  subrou¬ 
tine  DSYEVD,  for  example,  is  not  counted  as  in  the  BLAS3  category.  Secondly,  although  the  “correctness” 
of  such  a  classification  scheme  may  be  debatable,  it  does  not  at  alter  the  overall  fact,  as  is  clearly  shown  by 
our  computational  results,  that  the  category  BLAS3  is  much  more  scalable  than  the  category  RR  on  our  test 
platform. 

The  run  time  of  each  category  is  measured  in  terms  of  the  percentage  of  wall  clock  time  spent  in  that 
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category  over  the  total  wall  clock  time.  We  can  clearly  see  that  for  EigPen  the  run  time  of  BLAS3  dominates 
the  entire  computation  in  almost  all  cases.  The  BLAS3  time  increases  steadily  as  nev  increases  from  500  to 
3000,  while  the  SpMV  time  decreases  steadily.  The  run  time  of  RR  is  negligible.  However,  since  our  imple¬ 
mentation  of  EigPen  performs  extra  matrix  copying  when  computing  the  gradient  difference  YJ  defined  in 
(31)  for  the  BB  step  size  computation,  the  cost  associated  with  such  data  movement  is  notable.  In  LOBPCG, 
the  relative  cost  of  SpMV  also  decreases  as  nev  increase.  However,  the  run  time  of  RR  increases  rapidly  as 
nev  increases.  When  nev  >  1500,  the  run  time  of  RR  is  higher  than  that  of  BLAS3.  Note  that  the  RR  time 
for  LOBPCG  seems  out  of  proportion  when  nev  is  equal  to  1000  and  1500  for  the  matrix  Ga3As3H12.  The 
reason  is  that  we  have  to  perform  a  few  singular  value  decompositions  (SVD)  to  repair  a  rank  deficient  basis 
in  LOBPCG  and  the  cost  of  SVD  is  counted  in  RR.  The  rank  deficiency  is  caused  by  eigenvalues  clusters 
nears  the  1000th  and  the  1500th  eigenvalues. 


500  1000  1500  2000  2500  3000 

nev 


(a)  Andrews:  EigPen 


(bj  Andrews:  LOBPCG 


nev 


(c)  Ga3As3H12:  EigPen 


(d)  Ga3As3H12:  LOBPCG 


Figure  2:  A  comparison  of  timing  profile  between  EigPen  and  LOBPCG. 


Figures  1  and  2  clearly  demonstrate  that  the  advantages  of  the  EigPen  algorithm  are  due  to  fewer 
Rayleigh-Ritz  calculations.  This  advantage  is  more  pronounced  when  the  number  of  eigenpairs  to  be  com¬ 
puted  (nev)  is  large  because  the  cost  of  Rayleigh-Ritz  calculation  grows  rapidly  with  respect  to  nev  (and 
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k).  Although  the  complexity  of  the  Rayleigh-Ritz  calculation  is  the  same  as  that  associated  with  the  dense 
matrix-matrix  operations  required  for  updating  the  approximate  solution  in  EigPen,  dense  matrix-matrix  op¬ 
erations  can  be  implemented  efficiently  on  modern  high  performance  parallel  computers  whereas  it  is  more 
difficult  to  achieve  the  same  level  of  efficiency  for  RR  calculations.  As  a  result,  by  keeping  the  number  of 
Rayleigh-Ritz  calcuations  small  in  EigPen  and  making  use  of  more  BLAS3  operations,  we  can  make  it  more 
efficient  than  LOBPCG  for  large  nev  values. 

4.5  Parallel  scalability 

In  this  subsection,  we  examine  parallel  scalability  of  LOBPCG  and  EigPen.  For  brevity,  we  again  only 
show  results  for  the  Andrews  and  Ga3As3H12  matrices,  although  similar  results  can  be  seen  for  other  test 
problems  as  well.  We  define  the  speedup  factor  for  running  a  code  on  p  cores  as 

,  wall  clock  time  for  a  single  core  run 

speedup-factor  (p)  = - . 

wall  clock  time  for  a  p-core  run 

We  set  nev  =  1500  and  only  run  5  iterations  for  each  solver  since  the  speedup  factor  does  not  change  if 
more  iterations  are  performed. 

Figure  3  shows  the  speedup  factors  associated  with  SpMV,  BLAS3,  RR  and  DLACPY,  as  well  as  the 
overall  computation,  when  the  parallelized  Fortran  codes  arc  run  with  2,  4,  8,  16  and  24  cores.  As  we  can 
clearly  see  from  the  figure  that  the  speedup  factors  for  BLAS3  arc  nearly  perfect  when  these  codes  arc  run 
on  as  many  as  24  cores.  The  scalability  of  SpMV  is  almost  as  good  for  the  Ga3As3H12  problem.  But  it  is 
slightly  worse  beyond  8  cores  for  the  Andrews  matrix,  which  we  believe  is  due  to  a  higher  sparsity  of  the 
Andrews  matrix  that  makes  the  effect  of  thread  overhead  more  prominent  in  parallel  SpMV  calculations. 
However,  the  speedup  factor  for  RR  increases  slowly  with  respect  to  the  number  of  cores  up  to  8  cores,  then 
it  starts  to  decrease.  Because  computation  in  EigPen  is  heavily  dominated  by  BLAS3  (and  to  a  lesser  extent 
by  SpMV)  that  scales  much  better  than  RR,  the  overall  scalability  of  EigPen  is  better  than  that  of  LOBPCG. 

4.6  Preconditionining  for  EigPen 

One  can  also  introduce  a  preconditioner  in  EigPen  similar  to  LOBPCG.  The  use  of  a  preconditioner  essen¬ 
tially  amounts  to  a  change  of  variable  in  the  form  of  Y  =  LX.  If  an  appropriate  nonsingular  L  is  chosen, 
substituting  X  =  L~XY  into  (2)  yields  a  problem  with  a  better  conditioned  Hessian.  Let  M  =  LT L  be  a 
preconditioner.  It  is  not  difficult  to  show  that  a  preconditioned  gradient  method  can  be  described  by 

Xj+1  =  Xj  -  ajM~lXf^Xj).  (48) 

We  now  demonstrate  that  the  performance  of  EigPen  can  be  improved  by  preconditioning  using  the 
matrix  “c_65”  as  an  example.  Recall  from  Table  2  this  problem  is  relatively  difficult  to  be  solved  by 
EigPen  without  using  a  preconditioner.  The  following  experiment  is  performed  in  MATLAB  on  a  Dell 
Precision  M4700  workstation  with  Intel  i7-3720QM  CPU  at  2.60GHz  (x8)  and  16GB  of  memory  running 
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(a)  Andrews:  EigPen 


(b)  Andrews:  LOBPCG 


(c)  Ga3As3H12:  EigPen 


(d)  Ga3As3H12:  LOBPCG 


Figure  3:  A  comparison  of  timing  profile  between  LOBPCG  and  EigPen. 


Ubuntu  12.04  and  MATLAB  201  lb.  We  choose  to  use  MATLAB  because  it  is  relatively  easy  to  construct  a 
preconditioner  and  perform  M~  1 SJ  fjt  (XJ)  in  MATLAB. 

We  computed  nev  =  500  eigenpairs  with  the  tolerance  tol  =  10“  2  by  using  the  same  RR  restart 
Algorithm  2  as  in  subsection  4.3.  The  preconditioner  is  computed  by  performing  an  incomplete  Cholesky 
factorization  of  the  matrix  “c_65”  using  the  command 

L  =  ichol (A, opts )  , 

where  opts  =  struct  ( '  type'  ,  '  ict'  ,  '  droptol'  ,  le-2 ,  '  michol'  ,  'on'  ).  Without  precon¬ 
ditioning,  EigPen  consumed  4700  seconds  and  performed  a  total  number  of  3389  gradient  evaluations.  With 
the  preconditioner,  the  wall  clock  time  went  down  to  2339  seconds  and  the  total  number  of  gradient  evalua¬ 
tions  decreased  to  1375. 

In  order  to  see  the  effect  of  preconditioning  clearly,  we  show  the  iteration  history  of  the  gradient  norm 
{||  V//,(AfJ)||/7}  computed  by  Algorithm  1  using  the  unpreconditioned  scheme  (27)  and  the  preconditioned 
scheme  (48),  respectively.  The  results  are  depicted  in  Figure  4.  It  is  clear  that  preconditioning  can  improve 
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the  performance  of  EigPen  significantly. 


(a)  without  preconditioning  (b)  with  preconditioning 

Figure  4:  The  gradient  norm  ||V/At(X?)||j?  versus  iteration  on  c_65  matrix  without/with  preconditioning. 

We  should  point  out  that  it  is  generally  not  easy  to  identify  an  effective  and  efficient  preconditioner  for  an 
eigenvalue  problem.  An  ideal  preconditioner  should  be  close  to  A ~1  so  that  the  first  term  of  the  Hessian  (6) 
associated  with  the  transformed  problem  becomes  well  conditioned.  Yet,  the  pre-conditioner  itself  should 
not  be  too  ill-conditioned;  otherwise,  the  subsequent  terms  in  (6)  may  become  ill-conditioned,  possibly 
making  the  entire  Hessian  ill-conditioned.  The  purpose  of  presenting  the  above  example  is  not  to  promote 
a  particular  pre -conditioner,  but  rather  to  demonstrate  the  fact  that  EigPen  can  indeed  take  advantage  of  a 
good  pre-conditioner  whenever  it  is  available. 

5  Conclusion 

We  propose  and  study  a  quadratic  penalty  approach  for  large-scale  eigenspace  computation.  Our  analysis 
on  stationarity  of  the  penalty  function  establishes,  under  suitable  conditions,  not  only  an  equivalence  to 
the  original  eigenvalue  problem  in  terms  of  the  optimal  invariant  subspace,  but  also  a  desirable  property 
of  having  fewer  saddle  points.  It  appears  that  our  analysis  and  usage  of  quadratic  penalty  functions  for 
eigenspace  computation  is  new.  A  gradient-type  method  is  then  developed  in  which  the  number  of  Rayleigh- 
Ritz  (RR)  steps,  including  orthogonalization,  is  greatly  reduced  in  exchange  for  BLAS3-rich  operations. 
Hence,  our  method  has  the  potential  to  take  advantage  of  tens  or  hundreds  of  thousands-way  concurrency 
on  modern  multi/many-core  systems. 

Numerical  experiments,  based  on  a  Fortran  implementation  using  OpenMP,  are  conducted  on  a  parallel 
computer  to  evaluate  the  performance  of  our  new  eigensolver  EigPen  in  comparison  to  a  few  state-of-the- 
art  solvers  such  as  LOBPCG.  In  our  numerical  experiments  (where  preconditioning  is  not  used  to  avoid 
complications),  EigPen  generally  outperforms  LOBPCG  in  wall  clock  time,  especially  as  the  dimension 
of  the  computed  eigenspace  increases.  This  trend  in  favor  of  EigPen  can  be  explained  as  follows.  As  the 
eigenspace  dimension  increases,  the  computation  in  EigPen  is  increasingly  dominated  by  dense  matrix- 
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matrix  operations,  but  in  LOBPCG  it  is  increasingly  dominated  by  the  Rayleigh-Ritz  procedure  including 
orthogonalization.  Clearly,  the  former  is  more  scalable  than  the  latter  on  modern  high  performance  comput¬ 
ers. 

The  performance  of  EigPen  can  certainly  be  improved  in  several  aspects,  such  as  speeding  up  conver¬ 
gence  or  improving  accuracy,  with  the  help  of  a  number  of  techniques  or  of  Hessian  information.  Candidate 
techniques  may  include  subspace  optimization,  preconditioning,  deflation  and  polynimail  filtering  (in  fact 
we  have  tried  some  of  them  but  did  not  include  them  in  our  numerical  comparison  for  simplicity).  A  verifica¬ 
tion  of  the  parallel  efficiency  of  EigPen  using  the  Message  Passing  Interface  (MPI)  is  also  highly  expected. 
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